Over the past decade, with the growing influence of social media and heightened popularity of global stars, interest in women’s sports has skyrocketed. However, even with the appearance of high-profile campaigns for equity in sports1, it has nonetheless been difficult to cultivate an audience and build a market for women’s sports when they receive minimal media coverage compared to their male counterparts2. Although major tournaments, such as the WNBA Finals, are strong pulls for fans, the lack of nationally televised games and limited marketing budgets to showcase the players throughout the year has been a barrier for sports fans — especially established NBA fans — to consistently engage with the WNBA even when they’re interested. Furthermore, although we have seen how statistics can fuel sports passion and storytelling, it was only recently that data and advanced statistics for the WNBA became easily accessible to the public3. Therefore, we seek to not only promote sustained fan engagement and interactions with the WNBA, we hope to also provide better and more accessible statistics of the players as well.
Our project aims to make the following contributions, which will be displayed in a public facing Shiny App:
- Develop archetypes of current WNBA players based on each player’s tendencies, abilities, and overall statistics
- Conduct archetype exploration on NBA players using the same variables to discover similarities and differences in the type of players between the respective leagues
- Draw comparisons between WNBA players to NBA players
- Each WNBA player (that has played significant minutes/games) will be matched with 3-5 similar NBA players based SOLELY on their play style
Ultimately, we believe that labeling each WNBA player with an archetype and developing an NBA player comparison can boost year-round engagement and bring the WNBA into the spotlight and keep them there for years to come.
To define player archetypes in the WNBA and compare WNBA players to NBA players, data must be gathered on a seasonal basis. Using Basketball Reference4, player statistics were gathered dating back to 2018 (the first year WNBA play-by-play and shot location became available). Relevant variables included:
This data allows both play style and effectiveness to be evaluated and considered when developing player archetypes and subsequently creating player comparisons between WNBA players and NBA players.
Cleaning the WNBA all stats dataset:
Cleaning the NBA all stats dataset:
Before modeling or clustering, the distributions of variables within the WNBA dataset were examined to better understand the relationships that are present:
Position Distribution:
Distribution of Minutes Per Game across WNBA Players:
Shot distance:
Field Goals:
Assist & Block percentage:
To choose which subset of variables were important in determining the archetypes, principal component analysis (PCA) was used to reduce the dimensionality of the feature space.
To allow for some uncertainty in the clustering results, a Gaussian Mixture Model (GMM) was used to yield soft assignments for clustering the players.
Before running a model to derive play style comparisons, variables related to play style (and not results) were selected. These included:
To develop a model that outputs an NBA comparison for a WNBA player’s play style, a Gaussian Mixture Model (GMM) was trained using the past 5 seasons of NBA data (2018-2022). In doing so, clusters of NBA players were created with corresponding probabilities for each player belonging to each cluster. WNBA player profiles consisting of the same variables were then fed into the model, similarly receiving probabilities of belonging to each cluster. To derive the NBA player most similar to a WNBA player, the Euclidean distance between a WNBA player’s cluster probabilities and all NBA player’s cluster probabilities was calculated. The NBA player with the lowest corresponding distance of probabilities was selected as the comparison for the WNBA player of interest. A GMM was chosen over K-Means clustering to take advantage of soft assignments and the probabilities generated by a GMM.
These probabilities of an observation being assigned to each cluster were then used to compare players. This was done by computing the Euclidean distance between the cluster probabilities for each observation and taking the ‘closest’ observation (minimum distance). This allowed a WNBA player to be fed into the model, the distance to be calculated, and to take the ‘closest’ NBA player and declare that as the player comparison.
5 clusters/archetypes:
5 clusters/archetypes:
Table of comparisons for 3 example players:
| Sample of WNBA to NBA Comparisons | ||||||||
|---|---|---|---|---|---|---|---|---|
| WNBA Player | NBA Comp #1 | Distance 1 | NBA Comp #2 | Distance 2 | NBA Comp #3 | Distance 3 | NBA Comp #4 | Distance 4 |
| Kelsey Plum | Fred VanVleet | 0.005 | Anfernee Simons | 0.029 | Brandon Williams | 0.044 | Bones Hyland | 0.056 |
| Allie Quigley | Payton Pritchard | 0.068 | Kira Lewis Jr. | 0.220 | Jaylen Nowell | 0.231 | Davion Mitchell | 0.250 |
| Breanna Stewart | Karl-Anthony Towns | 0.008 | Josh Giddey | 0.015 | Brandon Ingram | 0.017 | Anthony Davis | 0.017 |
Initial shiny app examples for the 3 players in the table above:
We would like to first express our gratitude toward Carnegie Mellon’s Statistics & Data Science Department for providing us a great opportunity to complete a project on sports analytics. In particular, this work would not have been possible without the valuable guidance and support of Dr. Ron Yurko, the lead instructor and director of CMSAC, as well as Maxsim Horowitz, senior data analyst for the Atlanta Hawks, for advising our project. We are also grateful to all of those with whom we have had the pleasure to work during this and other related projects, including our fellow students and teaching assistants.
[1] https://www.teamheroine.com/blog/the-10-best-womens-sport-campaigns-of-2020
[2] https://www.si.com/sports-illustrated/2021/03/24/womens-sports-gender-study-discrepancy
[3] https://niemanreports.org/articles/covering-womens-sports/
[4] https://www.basketball-reference.com/wnba/years/2022_per_game.html
Carnegie Mellon University, amorai@cmu.edu↩︎
St. Olaf College, noecke2@stolaf.edu↩︎
Harvard University, mhombergbertley@college.harvard.edu↩︎